Transcriptome of the coralline alga Calliarthron tuberculosum (Corallinales, Rhodophyta) reveals convergent evolution of a partial lignin biosynthesis pathway

The discovery of lignins in the coralline red alga Calliarthron tuberculosum raised new questions about the deep evolution of lignin biosynthesis. Here we present the transcriptome of C. tuberculosum supported with newly generated genomic data to identify gene candidates from the monolignol biosynthetic pathway using a combination of sequence similarity-based methods. We identified candidates in the monolignol biosynthesis pathway for the genes 4CL, CCR, CAD, CCoAOMT, and CSE but did not identify candidates for PAL, CYP450 (F5H, C3H, C4H), HCT, and COMT. In gene tree analysis, we present evidence that these gene candidates evolved independently from their land plant counterparts, suggesting convergent evolution of a complex multistep lignin biosynthetic pathway in this red algal lineage. Additionally, we provide tools to extract metabolic pathways and genes from the newly generated transcriptomic and genomic datasets. Using these methods, we extracted genes related to sucrose metabolism and calcification. Ultimately, this transcriptome will provide a foundation for further genetic and experimental studies of calcifying red algae.


Introduction
Coralline red algae (Corallinales, Sporolithales, Hapalidiales) are a diverse lineage of calcified seaweeds that play important ecological roles in nearshore ecosystems worldwide: they stabilize coral reefs by creating a calcium carbonate matrix [1][2][3], induce settlement of invertebrate taxa [4][5][6], and contribute to the storage of blue carbon through the creation of biogenic calcium carbonates [7,8]. In recent years, there has been increased global attention paid to coralline algae. Taxonomists are clarifying their vastly underestimated species diversity [9][10][11][12]; ecologists and physiologists are documenting interspecific variation in coralline growth and calcification, particularly in response to climate stress, which may ultimately impact marine communities [13][14][15][16][17]; and evolutionary biologists are examining patterns in coralline trait evolution [18][19][20] and using >100 million-year-old coralline fossils to strengthen modern phylogenies [21][22][23]. The discovery of lignins within cell walls of the coralline species Calliarthron cheilosporioides (Corallinales, Rhodophyta) dramatically changed our perspective on the evolution of lignin biosynthesis [24]. Lignins are complex aromatic polymers predominantly found in the secondary cell walls of plant support tissues [25,26] and were long considered to have evolved when land plants emerged from the oceans, enabling upright growth in air [27]. Among the principal chemical components of wood, lignins in plant secondary cell walls help reinforce tissue mechanical properties, permit hydraulic transport, and increase pathogen resistance [28,29]. In the articulated coralline C. cheilosporioides, lignins were found predominantly within decalcified flexible joints, called genicula [24], that have remarkable biomechanical properties, permitting this articulated coralline species to thrive along wave-battered coastlines [30,31].
Because lignin biosynthesis is physiologically complex and involves several enzymes in the monolignol pathway [32][33][34], Martone et al. [24] proposed that much of the lignin biosynthetic pathway may have predated land plants altogether, evolving in a common ancestor of red and green algae more than one billion years ago. Alternatively, some (or all) of the monolignol biosynthetic pathway may have evolved independently in the embryophyte and rhodophyte lineages. For example, one important enzyme involved in S-lignin production (F5H) evolved independently in lycopods and embryophytes [35,36]. Moreover, candidate genes related to monolignol biosynthesis have since been found in diverse algal lineages such as diatoms, dinoflagellates, haptophytes, cryptophytes, and green and red algae [37], raising questions about how the monolignol pathway may have evolved across such evolutionarily divergent lineages.
Recently, researchers have suggested that genes related to monolignol production radiated and diversified during the transition to land, noting that the presence of multiple enzymes involved in monolignol production are not present in green algae, but begin to appear after the divergence of streptophytes [38]. Moreover, using genomic data, Labeeuw et al. [37] suggested that putative 4CL, CCR, and CAD sequences in Calliarthron tuberculosum may have arisen independently through horizontal gene transfer [37]. These studies begin to suggest a non-orthologous origin of monolignol production in red algae and land plants, but a complete analysis of all currently known genes in the monolignol biosynthetic pathway has not been completed. Labeeuw et al. [37] examined only three of the eight classes of genes involved in monolignol biosynthesis and based their conclusions entirely on genomic data [37]. Until now, questions about monolignol evolution in coralline red algae have largely gone unanswered as transcriptomic and genomic data have mostly been limited to non-coralline red algae (e.g. [39][40][41][42] but see [43,44]).
Here we present a transcriptome of the articulated coralline Calliarthron tuberculosum (a sister species of C. cheilosporioides) to investigate the evolutionary history of monolignol biosynthesis. Additionally, though a complete mitochondrial genome [45] and a draft nuclear genome [46] of C. tuberculosum were previously published, herein we generated a revised nuclear genome assembly using new short-read sequence data to aid validation of transcriptomic reads. Based on comparative analysis of genome and transcriptome data, we identify gene candidates for a putative monolignol biosynthetic pathway in C. tuberculosum and investigate evolutionary relationships of these enzymes with those from other taxonomic groups, including their land plant counterparts. We also provide a list of annotated genes in the C. tuberculosum transcriptome and a simplified method for extracting genes from metabolic pathways. We illustrate the utility of this dataset by extracting gene candidates involved in sucrose metabolism and calcification. This transcriptomic dataset provides a foundation for future studies of coralline algal ecology, physiology, and evolution.

The C. tuberculosum transcriptome is complete and supported by genomic data
Two transcriptomic datasets were generated from Calliarthron thalli: one from whole tissue (calcified intergenicula plus uncalcified genicula; sample I+G/PTM1 in the deposited data) and a second from intergenicular (i.e., calcified) tissue only (sample I/PTM2). Transcriptome sequencing based on RNA-Seq produced 38.8 total Gb of sequence data (17.3 Gb for sample I +G; 21.5 Gb for sample I). Reads were assembled de novo using Trinity. The whole tissue dataset had 172,700,376 total reads and the intergenicular tissue dataset had 215,491,160 total reads with an overall average coverage of 677-fold. A third reference transcriptome combining data from both PTM1 and PTM2 was assembled independently from raw reads, and this combined dataset was used for all subsequent analysis to increase coverage and maximize discovery. The transcriptome data were considered complete based on the recovery of core eukaryotic genes (e.g. 94.5% of CEGMA and 87.8% of BUSCO genes based on TBLASTN; S1A Fig). Genomic sequences were also assembled for C. tuberculosum (S1 Table), but these remain highly fragmented and were used only as additional support to the transcriptome data in subsequent searches below. More than half (18840; 56.6%) of the 33301 transcripts in the reference transcriptome were supported by the genome data (BLASTN, E � 10 -5 ).

The incomplete monolignol biosynthetic pathway in Calliarthron tuberculosum
The combined C. tuberculosum transcriptomic dataset was searched for genes encoding enzymes from the monolignol biosynthetic pathway. The transcriptomic dataset was translated into all six reading frames and queried with a combination of homology-based approaches, including HMMER searches and KEGG based annotations. Closest homologs from Arabidopsis thaliana were also verified (BLASTN, E � 10 -30 ). We identified gene candidates of 4CL, CCR, CAD, CSE, and CCoAOMT, but not HCT, COMT, PAL, TAL, or PTAL (Fig 1). PAL/ TAL/PTAL was considered absent as only fragmented (and no full length) sequences were identified. Evidence for the presence of homologous p450 enzymes (C3H, C3H, and F5H) was weak; as a result, their status was classified as ambiguous (Fig 1). All sequences identified had genomic support (BLASTN, E � 10 -5 ) except for those identified for PAL/TAL/PTAL.
Candidate sequences from C. tuberculosum (bolded as contig_gene_isoform in Figs 2-4) were characterized by comparing key residues with their land plant homologs in multiple sequence alignments. The evolutionary relationships between the identified C. tuberculosum sequences, closely related sequences in additional taxa, and sequences from the broader protein family of their land plant homologs were analyzed in gene trees. Below we describe in detail results for the main biosynthetic enzymes 4CL, CCR, and CAD (Figs 2-4). Descriptions of the other biosynthetic enzymes CCoAMT, CSE, and the cytochrome P450 sequences C3H, C4H, F5H are found in S1 Appendix and S2-S4 Figs.

Identification of 4CL candidates
4CL is an acyl-CoA synthase in the monolignol pathway and a member of the acyl-activating enzyme (AAE) superfamily. 4CL converts p-coumaric acid, caffeic acid, and ferulic acid into their respective hydroxycinnamoyl-CoA thioesters. We identified 11 candidate 4CL-coding transcripts: two based on KEGG analysis and nine additional sequences based on HMMER searches (Fig 2A). A query of these sequences against the A. thaliana proteome returned related proteins within the acyl-activating enzyme superfamily but not the A. thaliana 4CL (S2 Table). Moderate sequence conservation exists in substrate binding and hydroxycinnamate binding residues between 4CL candidates in C. tuberculosum (bolded) and 4CLs in land plants (identity similarity [IS] > 70% Fig 2A).

Identification of CCR candidates
CCR is the first committed enzyme in the monolignol pathway, reducing cinnamoyl-CoA esters to cinnamaldehydes. We identified three sequences as candidate CCR-coding transcripts: one based on KEGG analysis and two additional sequences based on HMMER searches ( Fig 3A). A query of these sequences against the A. thaliana proteome returned sequences within the CCR family (CCR7, CCR4, CCR-Like6) (S2 Table). Substrate-binding residues (NWYCY) and the hydroxycinnamonyl-binding pocket showed low sequence conservation (IS <80%). In contrast, the core catalytic residues (S, T, and K) and NADPH-binding residues appear to be conserved (IS >90%) between the candidate sequences in C. tuberculosum and CCRs in land plants ( Fig 3A).
In the CCR gene tree analysis, C. tuberculosum sequences varied in their relatedness to other taxa with some sequences closer to Rhodophytes and others more closely related to Oomycota/Mycetozoa/Fungi ( Fig 3B). Additionally, CCR candidates in C. tuberculosum were mapped with epimerase dehydratase type sequences that included the A. thaliana CCR family ( Fig 3B). Sequences from C. tuberculosum grouped with epimerase dehydratase type sequences of non-embryophyte origin. In contrast, embryophyte CCR, class 2 CCR, and CCR-like form an independent clade (BS >97%). The embryophyte CCR clade and the non-embryophyte epimerase dehydratase clade (containing sequences from C. tuberculosum) were more closely related than the embryophyte dihydroflavonol-4-reductase protein (DFR) group within the overall epimerase dehydratase family.

Identification of CAD candidates
CAD, the final step in the monolignol pathway, is an alcohol dehydrogenase converting various hydroxycinnamaldehydes to their respective hydroxycinnamyl alcohols. SAD, proposed to catalyze this same reaction for sinapyl monolignols [60], is added into our analysis despite debate over their function. We identified five sequences as candidate CAD-encoding transcripts: two based on KEGG analysis and three additional sequences based on HMMER searches ( Fig 4A). A query of these sequences against the A. thaliana proteome returned  [47,48]. Phenylalanine substrate binding pocket [49] is indicated with Box I and Box II. (B) Maximum likelihood acyl-activating enzyme (AAE) gene tree showing relationships between Calliarthron sequences (magenta dots) and other taxa (Embryophyta-dark green, Chlorophyta-light green, Rhodophyta-red, Animalia and Opisthokonta-purple, Bacteria and Cyanobacteria-blue, Oomycota, Mycetozoa and Fungi-yellow, Ochrophyta-brown). Functionally demonstrated plant 4CLs are labelled (+). Additional functional groups are labelled [50,51]. Ultrafast bootstrap values > 95 are marked by � . Model = WAG+F+G4. Sites with � 80% occupancy were removed. Accession numbers can be found in S1 Appendix.
In the CAD gene tree analysis, all C. tuberculosum sequences grouped with sequences from other Rhodophytes ( Fig 4B). CAD candidates in C. tuberculosum were mapped with their embryophyte CAD counterparts and closely related alcohol dehydrogenases. Sequences from C. tuberculosum grouped together with oxidoreductases (BS = 100%), sorbitol dehydrogenases (BS = 100%), general alcohol dehydrogenases (BS = 100%), and an algal CAD clade (BS = 100%). Sequences in this algal CAD clade were based on previous sequence similaritybased annotation and have not been functionally demonstrated. In contrast, the land plant CAD and SAD sequences form their own clades (BS 100%; Fig 4B) that are separated from the C. tuberculosum candidates by the functionally distinct alcohol dehydrogenases, such as yeast alcohol dehydrogenase 7 (ADH7) and E. coli aldehyde reductase (YAHK).

Identification of additional metabolic pathways in Calliarthron tuberculosum
To enable broad and rapid identification of C. tuberculosum genes involved in specific metabolic processes, we present two general tools for gene identification within the C. tuberculosum transcriptome dataset using KEGG based annotations. This involves extracting whole metabolic pathways or individual genes (see S1 Appendix and S5 Fig). We included annotations for all metabolic genes recovered in the C. tuberculosum transcriptome (S3 Table). We identified 36 putative C. tuberculosum genes present in the starch and sucrose metabolism pathway (S5  [52] and additional residues are indicated above with a black box. NADPH binding pocket residues are indicated with black triangles [53] and the GXXGXX[A/G] motif is underlined [54]. Hydroxycinnamonyl binding pocket residues are indicated with a gray triangle [53]. (B) CCR maximum likelihood gene tree showing relationships between C. tuberculosum (magenta dots) and other taxa (Embryophyta-dark green, Chlorophyta-light green, Rhodophyta-red, Animalia and Opisthokonta-purple, Bacteria and Cyanobacteria-blue, Oomycota, Mycetozoa and Fungi-yellow, Ochrophyta-brown). Functionally demonstrated plant CCRs are labelled (+). Additional functional groups are labelled. Ultrafast bootstrap values >95 are marked by � . Model = LG+G4. Sites with � 80% occupancy were removed. Accession numbers can be found in S1 Appendix.  Table). In addition, we individually searched for genes potentially involved in calcification [43,61,62] and identified 13 sequence candidates related to calcium transport, six related to inorganic carbon transport, five related to pH homeostasis, 19 putative carbonic anhydrases, and 12 putative HSP90 genes (S5 Table).

Evidence for convergent evolution of monolignol biosynthesis
Using sequence similarity methods with genes from the monolignol pathway in land plants, we identified candidates for five genes related to monolignol biosynthesis (4CL, CCR, CAD, CCoAOMT, and CSE) from the newly generated C. tuberculosum transcriptomic dataset. These gene candidates are supported by genomic evidence, retain major motifs from their respective gene family, and return their A. thaliana counterpart in reciprocal BLAST analyses, suggesting that these enzymes may function similarly in monolignol biosynthesis in C. tuberculosum.
Despite supporting evidence from sequence similarity analyses, functional predictions for candidate sequences in the monolignol pathway within C. tuberculosum are obscured by the gene tree analysis. If the monolignol pathway in embryophytes and C. tuberculosum evolved in a common ancestor and was retained through conserved evolution, we would expect their sequences to form functional clades uninterrupted by functionally divergent protein sequences. However, with the exception of the CCoAOMT candidate, our gene tree analyses consistently showed that monolignol biosynthetic genes in land plants are not sister to those in C. tuberculosum. C. tuberculosum sequences were found within each respective overall protein family, but consistently grouped with land plant genes of non-monolignol forming function. If these C. tuberculosum sequences are functionally homologous to the monolignol biosynthesis Partial alignment of C. tuberculosum CAD sequence candidates (bolded) with land plant CAD sequences. Zn +2 ion coordinating and proton shuttling residues are indicated with the black triangle, NADPH or NADH interacting residues are boxed. Hydrostatic interaction forming residues are indicated with a black box. Putative substrate-binding residues are indicated with grey boxes [55][56][57]. (B) CAD maximum likelihood gene tree showing relationships between C. tuberculosum (magenta dots) and other taxa (Embryophyta-dark green, Chlorophyta-light green, Rhodophytared, Animalia and Opisthokonta-purple, Bacteria and Cyanobacteria-blue, Oomycota, Mycetozoa and Fungi-yellow, Ochrophyta-brown). Alcohol dehydrogenase (ADH) sequences from yeast, and aldehyde reductase (YAHK and AHR) sequences from E. coli were used as the ADH family is closely related to that of CAD [58,59]. Functionally demonstrated plant CADs are labelled (+). Additional functional groups are labelled. Ultrafast bootstrap values >95 are marked by � . Model = LG+G4. Sites with � 80% occupancy were removed. Accession numbers can be found in S1 Appendix.
https://doi.org/10.1371/journal.pone.0266892.g004 counterpart in land plants, then they likely arose independently in C. tuberculosum. Convergent evolution in protein function, with phylogenetic patterns of protein sequences with similar functions intersected by sequences with dissimilar functions, is not uncommon in cell wall synthesizing enzymes [63]. Biosynthetic enzymes in C. tuberculosum could have evolved similar substrate specificity after the divergence of red algae and land plants or, alternatively, may reflect genes that were individually acquired. Previous evidence suggests that the core monolignol biosynthesis genes (4CL, CCR, and CAD) in C. tuberculosum may have been acquired through horizontal gene transfer from a bacterial source [36]. Thus, over evolutionary time genes in C. tuberculosum may have developed enough synchronicity in gene expression and protein regulation to produce an ad hoc monolignol biosynthetic pathway.
Alternatively, the phylogenetic evidence might suggest that gene candidates in C. tuberculosum do not function in monolignol biosynthesis and instead have a function similar to their sister sequences within their distinct phylogenetic groupings. For example, considering only clustering patterns in the phylogenetic data, perhaps C. tuberculosum contig 141618 functions as a CoA ligase that acts on malonate and not coumarate (4CL enzyme) (Fig 2B). However, the tandem use of stricter curated sequences in our predictive HMM models and more flexible HMM models with previously annotated data, such as KEGG annotations, improves our confidence in finding potential gene candidates. Biochemical or functional assays will ultimately be needed to verify the function of candidate gene sequences.

The monolignol biosynthesis pathway and missing steps in Calliarthron tuberculosum
Several key steps in the monolignol biosynthetic pathway were not recovered in the C. tuberculosum transcriptome, including PAL, TAL, PTAL, HCT, COMT, C3H, C4H, or F5H. Although we cannot dismiss that these observations may be due to fragmented sequences in the assembled genome and transcriptome data, we present several other possibilities.
The ammonia-lyase PAL, TAL, or PTAL creates the first substrates in the monolignol biosynthetic pathway [64][65][66]. Although no full-length homologs were identified in the C. tuberculosum transcriptome, short sequence candidates identified may represent a fragmented gene. However, these short sequences lacked genomic support, indicating they may be contaminants of non-Calliarthron origin. For this reason, PAL, TAL, and PTAL are currently indicated as absent (Fig 1). If these are indeed from C. tuberculosum, RACE amplification could help determine if the short ammonia-lyase we identified has a longer transcript. C. tuberculosum likely has an ammonia-lyase acting on phenylalanine or tyrosine since PAL and TAL are also key enzymes in producing flavanoids and coumarins, which have been previously detected in both fleshy and coralline red algae [67]. Further validation will be required to elucidate their presence. C3H, C4H, or F5H are p450 monooxygenases responsible for converting substrates across the monolignol pathway eventually resulting in H to S to G type monolignols, respectively (Fig  1). P450 sequence candidates have been identified, but their substrate-specific identity as C3H, C4H, or F5H homologs is unclear. The cytochrome P450 sequence candidates from the C. tuberculosum transcriptome form two divergent groups. One group is likely involved in carotenoid biosynthesis, positioned within the CYP97 clade, while the other group forms their own clade of unknown function (S2B Fig). The identified candidates from C. tuberculosum may have multi-substrate specificities, acting on various substrates, including monolignol intermediate products. Some substrate promiscuity has previously been observed within members of the cytochrome P450 enzyme family [68,69]. Alternatively, each of the identified P450 clades in C. tuberculosum could contain a new class of cytochrome P450 capable of functioning in H-, G-, or S-unit monolignol biosynthesis. This proposed convergent evolution of a distinct and independently-evolved cytochrome P450 involved in monolignol production has previously been documented in the clubmoss Selaginella moellendorffii (F5H) [35,36]. In any case, the presence of unique P450s represents an interesting avenue of exploration to elucidate substrate specificity and functionality in the monolignol pathway in C. tuberculosum.
HCT is one alternative route shifting monolignol synthesis from H-to G-to S-types using a temporary shikimate decoration (Fig 1) [70]. Its absence could suggest that C. tuberculosum does not utilize an HCT enzyme or create G lignin using this route. Another alternative route in G-and S-type monolignol synthesis utilizes a CSE enzyme that acts on caffeoyl shikimate, an HCT downstream product (Fig 1). The absence of an HCT is at odds with the CSE enzyme identified in this study (Fig 1), suggesting that the CSE candidate identified may not be utilized in the monolignol biosynthetic pathway for C. tuberculosum. Though this absence could be due to fragmentation in the transcriptome, more data are required for further validation.
COMT is necessary for S type monolignol production in angiosperms [71][72][73]. The absence of this enzyme raises questions about how C. tuberculosum can produce sinapyl alcohol, a precursor component for S monolignols. Some evidence exists for a bifunctional enzyme in pine that can function as both COMT and CCoAOMT (named AEOMT) in heterologous systems [74]. However, only moderate-to-low sequence similarity is shared among CCoAOMT, COMT, and the bifunctional AEOMT. Perhaps a similar protein with broad substrate specificity is present in C. tuberculosum but has yet to be identified based on sequence similarity.

Conclusion
In summary, we have identified several gene candidates in the C. tuberculosum transcriptome that represent possible components in the monolignol biosynthetic pathway, helping to explain the surprising presence of lignins in this coralline red alga. Despite the complexity of monolignol biosynthesis, and contrary to the predictions outlined in Martone et al. [24], our gene trees do not demonstrate a deeply conserved evolution of monolignol biosynthesis, but instead suggest that each of the enzymes identified in C. tuberculosum likely evolved independently from those found in land plants. Interestingly, there remain several key enzymes in the monolignol pathway whose sequences have not been identified, including those related to pathway entry and to shifting the types of monolignols produced that would form H-, G-, and S-lignins within the cell wall. Further biochemical evidence and validation of sequence expression will be necessary to provide functional support for both the genes identified and to elucidate potential alternative routes in the monolignol biosynthetic pathway in C. tuberculosum. By providing methods to easily identify additional gene candidates from the C. tuberculosum transcriptome, we aim to facilitate future research on this fascinating organism.

Specimen collection and sequencing
Two male, haploid specimens of Calliarthron tuberculosum were collected October 6, 2013, from Bluestone Point (48.81952, -125.1640), Bamfield, British Columbia, Canada. Samples were verified as haploid male specimens by identifying spermatangial conceptacles under a dissecting microscope. A portion of each collected sample was pressed and deposited into the UBC herbarium with voucher codes A89970 and A89985. Voucher codes can be queried at https://herbweb.botany.ubc.ca/herbarium/search.php?Database=algae for more information.
Samples were processed as either whole tissue containing calcified intergenicula and noncalcified genicula (Sample I+G/PTM1 in the dataset) or calcified tissue only (Sample I/PTM2 in the dataset). To reduce epiphyte contamination, samples used for sequencing were collected from newer growth at the tips where epiphytes had not yet settled. On site, samples were brushed, rinsed with ethanol, dissected for regions with both calcified intergenicula and noncalcified genicula or calcified intergenicula only. Fragments were observed under a dissecting microscope for the absence of epiphytes and then flash frozen in liquid nitrogen and stored at -80˚C until use. To extract RNA, samples were ground to a powder in a sterile pre-chilled mortar and pestle. The mortar was placed on dry ice and liquid nitrogen was added into the mortar to maintain sample temperatures at -80˚C. RNA was extracted from 70 mg-80 mg of ground sample using the Spectrum Plant Total RNA kit (STRN50, Sigma-Aldrich) with the following modifications. Tubes containing ground material and lysis solution were centrifuged for 10 minutes at 14000 g to pellet debris. Protocol A was used to bind RNA to the column. 750 μl of binding solution was used. The elutions were performed with 40 μl of pre-warmed 56˚C RNAse-free water. DNA was removed from 1200ng of RNA samples between 50 ng/μL and 400 ng/μL in concentration with DNase (AM1906, ThermoFisher). RNA was then converted into cDNA, and quality and yields were verified using Bioanalyzer 2100. Samples of each type were then pooled (i.e. all samples containing both calcified and uncalcified segments were pooled and samples containing calcified segments only were pooled). RNA was then sequenced on the Illumina HiSeq 2000 platform (paired-end 2x100bp, insert size~220bp) at the UBC Sequencing + Bioinformatics Consortium.

Transcriptome assembly and annotation
Illumina sequence reads were assembled using Trinity with the de novo mode at default setting [75], independently for each anatomical sample (I+G/PTM1; I/PTM2 in the ENA database). A reference transcriptome was also assembled de novo using Trinity by independently combining the sequence reads generated from both samples. The assembled transcripts were annotated using Blast2GO [76]. Briefly, each transcript was searched against the NCBI RefSeq protein database (BLASTX, E � 10 -5 ), and its putative function was inferred based on the top protein hit and Gene Ontology (GO) terms. These proteins were then mapped onto the corresponding metabolic pathways in the Kyoto Encyclopaedia of Gene and Genomes (KEGG) database [77]. Identification of genes present in KEGG annotated pathways were extracted using the pathview package [78].

Filtering contaminant sequences in genome assembled data
To identify putative contaminant sequences in the genome assembly, each genome scaffold was searched (BLASTN) against a database of archaeal, bacterial and viral genome sequences retrieved from the NCBI RefSeq database. Sequences with a significant hit (E � 10 −5 , covering > 50% of the query length) were considered putative contaminants and removed from the genome assembly. To identify broad differences in sequence characteristics, genomic scaffolds with and without transcriptomic support were compared for G+C content and transcript length (S1B Fig). Scaffolds with no transcript support and low recovery of eukaryotic genes (< 6% BUSCO or CEGMA recovery) were also identified as likely putative contaminants and removed from the genome assembly.

Genome annotation guided by transcriptome evidence
Repetitive elements in the genome assembly were identified and masked using RepeatMasker version open-4.0.6 [79]. To maximize recovery of transcript support for genome scaffolds, the transcriptomes (I+G/PTM1; I/PTM2 in the dataset) were mapped against the masked genome scaffolds using PASA v2.0.2 [80], and full-length coding sequences (CDSs) were predicted with TransDecoder v5.0.1 [75]. These CDSs represent the primary set of putative genes and were used as extrinsic hints to guide ab initio gene prediction using AUGUSTUS v3.2.1 [81] from the genome scaffolds.

HMM based gene candidate search
Monolignol biosynthesis gene candidates were identified from the C. tuberculosum transcriptomic dataset using Hidden Markov Model (HMM) based searches [82]. Transcriptomic sequence contigs were translated into all six reading frames using EMBOSS Transeq [83]. This amino acid database was used for subsequent sequence searches. HMM profiles used to search for homologs in the transcriptome were produced by aligning amino acid sequences of a given protein or protein family using MUSCLE [84] with no manual adjustment. The profiles were searched against the translated C. tuberculosum dataset in HMMER searches [82] to look for putative sequence homologs. Sequences more than 100 amino acids long were retained for subsequent analysis. These sequences were then searched against the Arabidopsis (GenBank taxid:3701) proteome using NCBI's BLAST [85] to verify their closest homolog match (BLASTP, E � 10 -30 ).

Domain and motif comparison
The monolignol biosynthetic genes and their overall gene families contain sequence domains that influence protein shape and function. To compare these key domains, multiple sequence alignments (MSA) of candidate amino acid sequences from C. tuberculosum with their land plant counterpart protein were produced. Sequences were aligned using MUSCLE under default settings [84]. Key domains and motifs were chosen based on available literature and highlighted in the MSA as indicated in each figure legend. In each MSA, an asterisk ( � ) represents full conservation; and a period (.) represents sites with conservation >50%. Accession numbers can be found in S1 Appendix.

Gene tree analysis
Gene trees were reconstructed for the identified candidate sequences of C. tuberculosum. For each gene tree analysis, sequence candidates from C. tuberculosum, the functionally demonstrated enzyme sequence from land plants, enzyme sequences from the overall protein family from land plants, and the top 20 sequences identified by NCBI BLAST using C. tuberculosum candidates as a query against the total database using default settings (BLASTP, E � 10 -20 ) were compiled. Land plant sequences identified to represent the functional gene and overall gene family were curated by a literature search. For each set of sequences, a multiple sequence alignment was performed using MUSCLE with default setting [84]. Sites with <80% coverage were removed using trimAl [86]. IQTree was used to search for the evolutionary model alignment under a BIC criterion [87,88]. A maximum likelihood tree was reconstructed using IQTree [89], with node support calculated based on 1000 ultrafast bootstrap pseudoreplicates in IQTree [89]. A clade is considered strongly supported when bootstrap value � 95%. FigTree was used to edit branch width and colors [90]. Species in the phylogenies are colored as followed Embryophyta (including both vascular and non-vascular plant species)-dark green, Chlorophyta-light green, Rhodophyta-red, Animalia and Opisthokonta-purple, Bacteria and Cyanobacteria-blue, Oomycota, Mycetozoa and Fungi-yellow, Ochrophyta-brown. Accession numbers can be found in S1 Appendix.

Generation of genome data as additional support for transcriptome data
Genome data of C. tuberculosum were generated using Illumina IIx platform (paired-end 2×150bp reads, insert size~350 bp). An overview of the summary statistics for the genome assembly can be found in S1 Table. Adapter sequences were removed using Trimmomatic v0.33 [91] (LEADING:25 TRAILING:25 HEADCROP:10 SLIDINGWINDOW:4:20 MIN-LEN:50). The generated filtered sequence reads and the previously published genome data (GenBank accession #: SRP005182) generated using the 454 pyrosequencing platform [46] were used in a de novo genome assembly using SPAdes [92]. The 454 reads were treated as unpaired, single-end reads in the assembly process. This de novo assembly was further scaffolded with the transcriptome data using the L_RNA_Scaffolder [93]. Putative contaminant sequences were removed based on shared similarity against known genome sequences from bacterial, archaeal, and viral sources in NCBI RefSeq (BLASTN, E � 10 -5 ), and subsequently based on discrepancy in G+C content of the assembled scaffolds, and the recovery of core eukaryotic genes (CEGMA and BUSCO). Because the genome assembly is fragmented, genome scaffolds on which no transcripts were mapped were filtered out, yielding the final genome assembly (21,672 scaffolds, total bases 64.15 Mbp). These genome scaffolds were used as additional support for the transcriptome data. For the reference transcriptome (combined I +G/PTM1; I/PTM2), putative coding sequences were predicted based on alignment of the assembled transcripts against the genome scaffolds using PASA [80] and TransDecoder [75], from which the coded protein sequences were predicted.

Completeness of transcriptome and genome data
The completeness of the genome and transcriptome data were assessed by the recovery of core conserved eukaryote genes with the Core Eukaryotic Genes Mapping Approach (CEGMA) [94] and Benchmarking Universal Single-Copy Orthologs (BUSCO) [95] datasets. CEGMA and BUSCO datasets (eukaryote odb9 and Viridiplantae odb10) were independently used as query to search against the predicted proteins from the reference transcriptome (combined IG and IO) using BLASTP (E � 10 -5 ) and against the same transcriptome using TBLASTN (E � 10 -5 ). The core CEGMA and BUSCO proteins were also queried against the 21,672 genome scaffolds using TBLASTN (E � 10 -5 ).

Key resources
A summary of key resources used in this study, including sample identifiers, reagents, online resources, and programs can be found in S6 Table. Supporting information S1 Appendix. Supplemental methods and results. Genome reassembly and removal of contamination; analysis of completeness for genomic and transcriptomic datasets; E-value thresholds used in gene candidate identification; accession numbers for all sequences used; identification of cytochrome P450s sequences (C3H, C4H, and F5H), CCoAOMT, and CSE; and the benefits and limitations of using KEGG mapping for gene annotation of Calliarthron sequences. (DOCX) S1 Fig. Completeness of the C. tuberculosum transcriptome dataset. (A) Transcriptome sequences show high recovery of eukaryotic genes in CEGMA/BUSCO analysis. Percentage of genomic scaffolds with transcriptome support and transcriptomic scaffolds alone that share amino acid sequences with the core eukaryotic gene databases including CEGMA, BUSCO eukaryotic, and BUSCO Viridiplantae. Transcriptome encoded amino acid sequences were searched against the databases using BLASTP (orange) or TBLASTN (yellow), and genomic scaffolds were searched against the databases using TBLASTN (blue). (B) Transcriptomic support of genomic data analyzed by GC content and transcript length. The distribution of GC content (above) against transcript lengths is shown for scaffolds with transcriptome support (blue) and scaffolds without transcriptome support (yellow) (right). (TIF) S2 Fig. C3H, C4H, F5H, P450 candidates from C. tuberculosum in relation to plants and other taxa. (A) Partial alignment of C. tuberculosum P450 candidates with C3H, C4H, and F5H from A. thaliana, and a novel F5H from Selaginella moellendorffii. Heme binding domain residues, secondary structure stabilizing K helix residues, PXRX, and the I-helix are indicated [8]. Sites with <80% coverage were removed. A strong candidate for beta-carotene synthesis is indicated with a triangle. (B) Unrooted CYP450 maximum likelihood gene tree with C. tuberculosum (magenta dots) and additional taxa (Embryophyta-dark green, Chlorophyta-light green, Rhodophyta-red, Animalia and Opisthokonta-purple, Bacteria and Cyanobacteriablue, Oomycota, Mycetozoa and Fungi-yellow, Ochrophyta-brown). Functionally demonstrated plant C3H, C4H, and F5H are labeled (+). Additional functional groups are labeled [9]. showing the starch and sucrose metabolic pathway with C. tuberculosum annotations highlighted. The gradient map in the top right corner indicates the level of transcription, with white and dark pink coloring representing absence and presence of expression respectively. The annotated map, number "00500", was extracted in the provided R file using the pathview program. (TIF) S1 Table. Summary statistics for the C. tuberculosum genome assembly. Scaffolds are categorized as shared with either red algal (Pyropia yezoensis) genomic scaffolds, eukaryotic sequences, or other bacteria sequences based on sequence similarity. (XLSX) S2 Table. Top hits against Arabidopsis thaliana (taxid:3702) using Calliarthron sequences as the search query (BLASTP). Query sequence is indicated by contig number. Result hits are indicated by description (At tax ID 3702) and colored by overall alignment scores with red (> = 200), pink , green , blue (40)(41)(42)(43)(44)(45)(46)(47)(48)(49)(50), and black (<40) that are most to least reliable scores in that order. (XLSX) S3 Table. KEGG annotations of Calliarthron tuberculosum reads from the combined transcriptomic dataset. Unique reads are represented by their contig identifier (contig_gene_isoform) and matched with their annotated KEGG based identifier (KO_identifier) and associated protein name. (XLSX) S4 Table. Listed representation of the C. tuberculosum sequences present in starch and sucrose metabolism pathway from the KEGG based annotation. C. tuberculosum sequences were extracted from the KEGG based starch and sucrose metabolism pathway number "00500". "KEGG Identifier" refers to the specific KEGG code for the gene, "Contig Name" refers to the sequence identifier from the Calliarthron transcriptome representing the contig name_gene number_gene isoform and "Gene Name" refers to the gene acronym, the gene name, and its enzyme commission (EC) number. Sequences were extracted in the provided R file using the pathview program, accessible from the Github link provided. (XLSX) S5 Table. A list of calcification related gene candidates identified from KEGG-based annotations of the C. tuberculosum transcriptome. Calcification gene candidates were initially selected based on a literature search, and additional C. tuberculosum sequences were identified manually from the KEGG based annotations (annotation file available on the Github link), thus this list is not exhaustive. The genes are organized by their functional classification indicated as "overall function". "KEGG Identifier" refers to the specific KEGG code for the gene, "Contig Name" refers to the sequence identifier from the Calliarthron transcriptome representing the contig name_gene number_gene isoform and "Gene Name" refers to the gene acronym, the gene name, and its enzyme commission (EC) number. (XLSX) S6